cannot-link constraint
A Scalable Global Optimization Algorithm For Constrained Clustering
Chumpitaz-Flores, Pedro, Duong, My, Heredia, Cristobal, Hua, Kaixun
Constrained clustering leverages limited domain knowledge to improve clustering performance and interpretability, but incorporating pairwise must-link and cannot-link constraints is an NP-hard challenge, making global optimization intractable. Existing mixed-integer optimization methods are confined to small-scale datasets, limiting their utility. We propose Sample-Driven Constrained Group-Based Branch-and-Bound (SDC-GBB), a decomposable branch-and-bound (BB) framework that collapses must-linked samples into centroid-based pseudo-samples and prunes cannot-link through geometric rules, while preserving convergence and guaranteeing global optimality. By integrating grouped-sample Lagrangian decomposition and geometric elimination rules for efficient lower and upper bounds, the algorithm attains highly scalable pairwise k-Means constrained clustering via parallelism. Experimental results show that our approach handles datasets with 200,000 samples with cannot-link constraints and 1,500,000 samples with must-link constraints, which is 200 - 1500 times larger than the current state-of-the-art under comparable constraint settings, while reaching an optimality gap of less than 3%. In providing deterministic global guarantees, our method also avoids the search failures that off-the-shelf heuristics often encounter on large datasets.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Florida (0.04)
- North America > Costa Rica > Heredia Province > Heredia (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.88)
Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures
Xie, Wen-Bo, Fu, Xun, Chen, Bin, Lee, Yan-Li, Deng, Tao, Zou, Tian, Wang, Xin, Liu, Zhen, Srivastavad, Jaideep
In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- Asia > China > Sichuan Province > Chengdu (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
Exact and Heuristic Algorithms for Constrained Biclustering
Biclustering, also known as co-clustering or two-way clustering, simultaneously partitions the rows and columns of a data matrix to reveal submatrices with coherent patterns. Incorporating background knowledge into clustering to enhance solution quality and interpretability has attracted growing interest in mathematical optimization and machine learning research. Extending this paradigm to biclustering enables prior information to guide the joint grouping of rows and columns. We study constrained biclustering with pairwise constraints, namely must-link and cannot-link constraints, which specify whether objects should belong to the same or different biclusters. As a model problem, we address the constrained version of the k-densest disjoint biclique problem, which aims to identify k disjoint complete bipartite subgraphs (called bicliques) in a weighted complete bipartite graph, maximizing the total density while satisfying pairwise constraints. We propose both exact and heuristic algorithms. The exact approach is a tailored branch-and-cut algorithm based on a low-dimensional semidefinite programming (SDP) relaxation, strengthened with valid inequalities and solved in a cutting-plane fashion. Exploiting integer programming tools, a rounding scheme converts SDP solutions into feasible biclusterings at each node. For large-scale instances, we introduce an efficient heuristic based on the low-rank factorization of the SDP. The resulting nonlinear optimization problem is tackled with an augmented Lagrangian method, where the subproblem is solved by decomposition through a block-coordinate projected gradient algorithm. Extensive experiments on synthetic and real-world datasets show that the exact method significantly outperforms general-purpose solvers, while the heuristic achieves high-quality solutions efficiently on large instances.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.92)
PCCC: The Pairwise-Confidence-Constraints-Clustering Algorithm
Baumann, Philipp, Hochbaum, Dorit S.
We consider a semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm can include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard constraints or all soft constraints both in terms of running time and various metrics of solution quality. The source code of the PCCC algorithm is publicly available on GitHub.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Switzerland > Bern > Bern (0.04)
An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering
Piccialli, Veronica, Russo, Anna Russo, Sudoso, Antonio M.
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semi-supervised or constrained clustering. In this paper, we present a branch-and-cut algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we use an adaptation of the k-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms.
- Europe > Italy (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
Expert-driven Trace Clustering with Instance-level Constraints
De Koninck, Pieter, Nelissen, Klaas, Broucke, Seppe vanden, Baesens, Bart, Snoeck, Monique, De Weerdt, Jochen
Within the field of process mining, several different trace clustering approaches exist for partitioning traces or process instances into similar groups. Typically, this partitioning is based on certain patterns or similarity between the traces, or driven by the discovery of a process model for each cluster. The main drawback of these techniques, however, is that their solutions are usually hard to evaluate or justify by domain experts. In this paper, we present two constrained trace clustering techniques that are capable to leverage expert knowledge in the form of instance-level constraints. In an extensive experimental evaluation using two real-life datasets, we show that our novel techniques are indeed capable of producing clustering solutions that are more justifiable without a substantial negative impact on their quality.
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.05)
- Europe > United Kingdom > England > Hampshire > Southampton (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (5 more...)
A semi-supervised sparse K-Means algorithm
Vouros, Avgoustinos, Vasilaki, Eleni
We consider the problem of data clustering with unidentified feature quality but the existence of small amount of label data. In the first case a sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering and in the second case a semi-supervised method can use the labelled data to create constraints and enhance the clustering solution. In this paper we propose a K-Means inspired algorithm that employs these techniques. We show that the algorithm maintains the high performance of other similar semi-supervised algorthms as well as keeping the ability to identify informative from uninformative features. We examine the performance of the algorithm on real world data sets with unknown features quality as well as a real world data set with a known uninformative feature. We use a series of scenarios with different number and types of constraints.
- Europe > United Kingdom (0.14)
- South America > Paraguay > Asunción > Asunción (0.04)
- Asia > Middle East > Jordan (0.04)
A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints
Śmieja, Marek, Struski, Łukasz, Figueiredo, Mário A. T.
A Classification-Based Approach to Semi-Supervised Clustering with Pairwise Constraints Marek Smieja a,, Łukasz Struski a, Mário A. T. Figueiredo b a Faculty of Mathematics and Computer Science, Jagiellonian University, Kraków, Poland b Instituto de T elecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, PortugalAbstract In this paper, we introduce a neural network framework for semi-supervised clustering (SSC) with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose SSC into two simpler classification tasks/stages: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach, S 3 C 2 (Semi-Supervised Siamese C lassifiers for C lustering), is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method. Keywords: semi-supervised clustering, deep learning, neural networks, pairwise constraints 1. Introduction Clustering is an important unsupervised learning tool often used to analyze the structure of complex high-dimensional data. Semi-supervised clustering (SSC) methods tackle this issue by leveraging partial prior information about class labels, with the goal of obtaining partitions that are better aligned with true classes [1, 2, 3, 4, 5, 6]. One typical way of injecting class label information into clustering is in the form of pairwise constraints (typically, must-link and cannot-link constraints), or pairwise preferences (e.g., should-link and shouldn't-link), which indicate whether a given pair of points is believed to belong to the same or different classes. Most SSC approaches rely on adapting existing unsupervised clustering methods to handle partial (namely, pairwise) information [7, 8, 4, 5, 6, 9]. This requires transferring class-label knowledge into a clustering algorithm, which is often unnatural and puts a higher weight on clustering structure than on class labels.
- Europe > Portugal > Lisbon > Lisbon (0.54)
- Europe > Poland > Lesser Poland Province > Kraków (0.24)
- North America > United States > District of Columbia > Washington (0.04)
- Asia > Middle East > Jordan (0.04)
Constrained K-means with General Pairwise and Cardinality Constraints
Bibi, Adel, Wu, Baoyuan, Ghanem, Bernard
In this work, we study constrained clustering, where constraints are utilized to guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. However, most existing constrained clustering models can only utilize one category of constraints at a time. In this paper, we enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As these two categories provide useful information at different levels, utilizing both of them is expected to allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the proposed unified formulation. To alleviate this difficulty, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. Then we derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers (ADMM) algorithm. Extensive experiments on both synthetic and real data demonstrate: (1) when utilizing a single category of constraint, the proposed model is superior to or competitive with state-of-the-art constrained clustering models, and (2) when utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category.
- Asia > Middle East > Saudi Arabia > Mecca Province > Thuwal (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China (0.04)
PIVETed-Granite: Computational Phenotypes through Constrained Tensor Factorization
Henderson, Jette, Malin, Bradley A., Ho, Joyce C., Ghosh, Joydeep
It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenotyping. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map well to clinical concepts or may be very similar to other suggested phenotypes. To address these issues, we present a novel, automatic approach called PIVETed-Granite that mines existing biomedical literature (PubMed) to obtain cannot-link constraints that are then used as side-information during a tensor-factorization based computational phenotyping process. The resulting improvements are clearly observed in experiments using a large dataset from VUMC to identify phenotypes for hypertensive patients.
- Europe > United Kingdom > England > Greater London > London (0.05)
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)